Linear Mixed Models : theory and application to data in cognitive science

Module 2 : Introducing the shotgun

     

Gabriel Weindel, PhD, April 2021 at AMU

"encouraging psycholinguists to use linear mixed-effects models [is] like giving shotguns to toddlers” (Prominent psycholinguist cited by Barr et al. 2013)

Benefits of LMMs

Costs of LMMs

Organization of the second module

  1. Maximum likelihood estimation
  2. What are hierarchies in the data
  3. Formalism behind LMM
  4. Illustrating the concept of random effects
  5. Adding predictor
  6. Full LMM

Maximum likelihood estimation

This bit is intended to give you an intuition about MLE but see the paper by Myung 2003 for an accessible tutorial

We are going to flip coins to illustrate the concept

What is the probability of $x$ number of heads given $n$ coin flips

$$ P(x|n,p) = \frac{n!}{x!(n-x)!} \cdot p^x(1 - p)^{n-x} $$

If I come up with 28 heads for 40 flips:

With assumpation that the coin is unbiased, i.e. $p=0.5$ :

\begin{equation} P(28|40,.5) = \frac{40!}{28!(40-28)!} \cdot .5^{28}(1 - .5)^{40-28} \approx .002 \end{equation}

with $p=0.7$ \begin{equation} P(28|40,.7) = \frac{40!}{28!(40-28)!} \cdot .70^{28}(1 - .7)^{40-28} \approx .137 \end{equation}

Given these results, what is the most likely ? unbiased coin or coin biased towards heads ?

Now moving to continuous DVs same logic applies except that we are not looking at probability mass functions but probability density functions, e.g. density of a normal distribution :

$$ f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right) $$

We are looking for the parameter values (e.g. $\mu$) that maximises the likelihood of the data. E.g. :

vs.

This, obviously in a more complex way, is the method under the hood of parameter estimation in the LMMs we are going to study : we optimize (maximize) the likelihood function of our models.

What are hierarchies in the data ?

Distributions of parameters of a distribution

Let's create a data set with hierarchies

Imagine a WM task where we sampled $x$ WM scores by participant $\times$ condition (within design and repeated measure)

We want to create participants which present an inter-individual variability, therefore we define a distribution on the parameter $\mu$ so that each individual receives a unique $\mu$ parameter

The distribution we will sample from is given below :

Now we create a for loop were we generate up to 100 trials for 15 participants, each run draws random values for the $\mu$ parameter and a random number of trial (e.g. a priori rejection criterion, fatigue,...)

We can look at the n of trials we sampled

We can take a look at the distribution of WM for all the participants

or for each participant

Adding experimental levels

We will now add an experimental condition, e.g. imagine we have a perfect longitudinal study on age and working memory, we tested each individual again at different ages

For a starter we are going to pre-average the data

A simple LM :

To respect assumptions of LM we need i.i.d. data :

Formalism behind LMM

Mixed models are said mixed because they include fixed effects and random effects (confusing terminology: multilevel, hierarchical, random,...)

LMM are extensions of LM :

$y_{ji} \sim \mathcal{N}(\mu_j, \sigma)$

But this time :

$\mu_j = \alpha_j + \beta_j IV$

$\alpha_j \sim \mathcal{N}(\mu_\alpha, \sigma_\alpha)$

$\beta_j \sim \mathcal{N}(\mu_\beta, \sigma_\beta)$

As in module 1, adding predictors just increase the number of $\beta_j$

$\alpha_j \sim \mathcal{N}(\mu_\alpha, \sigma_\alpha)$

$\beta_j \sim \mathcal{N}(\mu_\beta, \sigma_\beta)$

As in module 1, adding predictors just increase the number of $\beta_j$

Illustrating the concept of random effects

We are now going to use (one of) the R package allowing to fit mixed models : lme4

Next we illustrate the notion of random intercept vs random slopes vs full random effects (R code used was largely inspired by : http://mfviz.com/hierarchical-models/ )

A model with random effects on the intercept ($\alpha_j$)

A model with random effects on the slope ($\beta_j$)

A model with random effects on the intercept and on the slope ($\alpha_j$ and $\beta_j$)

Illustrating shrinkage

We will compare parameter estimate from one lm() / participant vs the partial pooling operated by LMMs

To go further on the notion of pooling (one LM), vs. partial pooling (LMM) vs. no-pooling (one LM / participant) see the blog post by T.Mahr

Inspecting and comparing the outputs

"Significance" testing

When focusing on significance testing (but also when focus is on estimation) you should carefully think about the contrast you used for your factors (cf. module 1).

e.g. :

How do I assess significance of my predictors ?

Full random effect structure does not necessarly imply higher power but it does (as good as possible) guard against Type I errors, see Barr et al., 2013

variance/covariance matrix

Note that the last model, including two random effects, did not estimate a correlation among those random effects. However usually this could be true (e.g. participants with higher scores/intercept might be less subject to age decline/slope) and we would like to estimate it.

This is best represented by a variance-covariance matrix :

$$\begin{pmatrix} \text{Var(Intercept)} & \text{Cov(Intercept,Slope)} \\\ \text{Cov(Intercept,Slope)} & \text{Var(Slope)} \end{pmatrix}$$

Let's recreate the data with a defined var-covariance matrix : $$\begin{pmatrix} 1 & 0.66 \\ 0.66 & 2 \end{pmatrix}$$

In our model the estimation made of the covariance matrix is = $$\begin{pmatrix} 1.17 & 1.32 \\ 1.32 & 3.42 \end{pmatrix}$$

Note that the random effect variances and covariances are difficult to estimate when we have small sample sizes (as always by the way, e.g. for correlation see blog post by G. Rousselet)

Adding another predictor

We have a convergence error, max likelihood optimizer is having some trouble converging (when you get convergence problems, see this page).

In our case we can simply change the optimization criterion

But warnings can be informative about the structure of the LMM we are trying to fit, e.g. we could have more parameters than observations

Problem is also the number of units in your random effects. The higher the number the better will be the estimation of variance (and co-variance !) of random effects

Plotting the multiple predictor model

Now checking models assumption

Fitting a LMM at the trial level

Here we build a linear mixed model with all fixed effects (simple and interaction) and all random effects (intercept and all effects) at the trial level (without pre-averaging)

-> We use the model with the maximum of information where averaging is part of the model

And the assumption of our model ??

Other sources of randomness

A cross random factor mixed model E.g. if each trial had a different wm test/stimuli

Final words for Module 2

Practical advices on LMM :

If you want to learn more about simulation, mixed models and e.g. power calculation see DeBruine and Barr 2021

When using mixed models it is advocated to always keep the random structure maximal (Barr et al., 2013) but using real data we often end up in computational problems in the estimation. Other language (e.g. Julia) ? Bayesian estimation ?

Moving to module 3 after some practical exercices on the data you brought